LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl

نویسندگان

  • Szymon Roziewski
  • Wojciech Stokowiec
چکیده

The web data contains immense amount of data, hundreds of billion words are waiting to be extracted and used for language research. In this work we introduce our tool LanguageCrawl which allows Natural Language Processing (NLP) researchers to easily construct web-scale corpus the from Common Crawl Archive: a petabyte scale open repository of web crawl information. Three use-cases are presented: filtering Polish websites, building N-gram corpora and training continuous skip-gram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and N-gram ranks. Special effort has been put on high computing efficiency, by applying highly concurrent multitasking. We make our tool publicly available to enrich NLP resources. We strongly believe that our work will help to facilitate NLP research, especially in under-resourced languages, where the lack of appropriately sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

N-gram Counts and Language Models from the Common Crawl

We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the c...

متن کامل

Explorer Edinburgh ’ s Phrase - based Machine Translation Systems for WMT - 14

This paper describes the University of Edinburgh’s (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) us...

متن کامل

Edinburgh's Phrase-based Machine Translation Systems for WMT-14

This paper describes the University of Edinburgh’s (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) us...

متن کامل

Exploring Rhetorical-Discursive Moves in Hassan Rouhani’s Inaugural Speech: A Eulogy for Moderation

Before a president practically begins his four-year term of office in Iran, a formal inaugural ceremony is held in the parliament. Being attended by national dignitaries and representatives from other countries, the inauguration of Iran's seventh president, Hasan Rouhani, was spectacular in several respects. The current study aimed at investigating the generic structure and rhetorical moves tha...

متن کامل

Conceptual Metaphoric Language Use in Structuring Political Discourse in Iran-West Relations: A CDA Perspective

The present study was carried out with the purpose of examining the role of metaphorical language in the critical discourse analysis (CDA) of political texts based on a modern framework postulated by Kövecses (2015). The corpus of the study consisted of thirty-thousand words chosen as a textual sample to see which source conceptual domains are used and what generic/discursive attributes emerge ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016